R Tools for Expolratory Data Analysis: Tibbles, data manipulation and introduction to graphics
Background
What is the purpose of these notes?
- Provide an overview of:
- Tibbles,
- data manipulation,
- Introduce various graphics tricks.
- You don’t have to learn everything in this handout, but can use it sort of as a cheat-sheet when you work on your own data. Still, it’s good to walk through it.
Installing and loading packages
Just like every other programming language you may be familiar with, R’s capabilities can be greatly extended by installing additional “packages” and “libraries”.
To install a package, use the install.packages() command. You’ll want to run the following commands to get the necessary packages for today’s lab:
install.packages("tidyverse")
install.packages("ggplot2")
install.packages("knitr")
You only need to install packages once. Once they’re installed, you may use them by loading the libraries using the library() command. For today’s lab, you’ll want to run the following code
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
✓ ggplot2 3.3.3 ✓ purrr 0.3.4
✓ tibble 3.0.4 ✓ dplyr 1.0.3
✓ tidyr 1.1.2 ✓ stringr 1.4.0
✓ readr 1.4.0 ✓ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
Context
As we learned in this week’s lectures, Exploratory Data Analysis (EDA) is a process, a state of mind, and for it you need a few tools and pro-tips. This handout provides some of those.
Getting started: birthwt dataset
We’re going to start by operating on the
birthwtdataset from the MASS libraryLet’s get it loaded and see what we’re working with. Remember, loading the MASS library overrides certain tidyverse functions. We don’t want to do that. So when we need something from MASS we’ll extract that dataset or function directly.
tibbles
tibblesare nicer data frames- You may find it more convenient to work with tibbles instead of data frames
- In particular, they have nicer and more informative default print settings
- The
dplyrfunctions we’ve been using are very nice because they map tibbles to other tibbles.
# A tibble: 189 x 10
low age lwt race smoke ptl ht ui ftv bwt
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 0 19 182 2 0 0 0 1 0 2523
2 0 33 155 3 0 0 0 0 3 2551
3 0 20 105 1 1 0 0 0 1 2557
4 0 21 108 1 1 0 0 1 2 2594
5 0 18 107 1 1 0 0 1 0 2600
6 0 21 124 3 0 0 0 0 0 2622
7 0 22 118 1 0 0 0 0 1 2637
8 0 17 103 3 0 0 0 0 1 2637
9 0 29 123 1 1 0 0 0 1 2663
10 0 26 113 1 1 0 0 0 0 2665
# … with 179 more rows
Note: If you want to import data directly into
tibbleformat, you may useread_delim()andread_csv()instead of their base-R alternatives. Even though we started with the base alternatives, I recommend using these improved import commands going forward.
Renaming the variables
The dataset doesn’t come with very descriptive variable names
Let’s get better column names (use
help(birthwt)to understand the variables and come up with better names)
[1] "low" "age" "lwt" "race" "smoke" "ptl" "ht" "ui" "ftv"
[10] "bwt"
# The default names are not very descriptive
colnames(birthwt) <- c("birthwt.below.2500", "mother.age",
"mother.weight", "race", "mother.smokes",
"previous.prem.labor", "hypertension",
"uterine.irr", "physician.visits", "birthwt.grams")
# Better names!An alternative renaming approach: the rename() command
rename operates by allowing you to specify a new variable name for whichever old variable name you want to change.
# Reload the data again
birthwt <- as_tibble(MASS::birthwt)
birthwt <- birthwt %>%
rename(birthwt.below.2500 = low,
mother.age = age,
mother.weight = lwt,
mother.smokes = smoke,
previous.prem.labor = ptl,
hypertension = ht,
uterine.irr = ui,
physician.visits = ftv,
birthwt.grams = bwt)
colnames(birthwt) [1] "birthwt.below.2500" "mother.age" "mother.weight"
[4] "race" "mother.smokes" "previous.prem.labor"
[7] "hypertension" "uterine.irr" "physician.visits"
[10] "birthwt.grams"
Note that in this command we didn’t rename the race variable because it already had a good name.
Renaming the factors
All the factors are currently represented as integers
Let’s use the
mutate(),mutate_at()andrecode_factor()functions to convert variables to factors and give the factors more meaningful levels
birthwt <- birthwt %>%
mutate(race = recode_factor(race, `1` = "white", `2` = "black", `3` = "other")) %>%
mutate_at(c("mother.smokes", "hypertension", "uterine.irr", "birthwt.below.2500"),
~ recode_factor(.x, `0` = "no", `1` = "yes"))
birthwt# A tibble: 189 x 10
birthwt.below.2… mother.age mother.weight race mother.smokes
<fct> <int> <int> <fct> <fct>
1 no 19 182 black no
2 no 33 155 other no
3 no 20 105 white yes
4 no 21 108 white yes
5 no 18 107 white yes
6 no 21 124 other no
7 no 22 118 white no
8 no 17 103 other no
9 no 29 123 white yes
10 no 26 113 white yes
# … with 179 more rows, and 5 more variables: previous.prem.labor <int>,
# hypertension <fct>, uterine.irr <fct>, physician.visits <int>,
# birthwt.grams <int>
Recall that the syntax ~ recode_factor(.x, ...) defines an anonymous function that will be applied to every column specfied in the first part of the mutate_at() call. In this case, all of the specified variables are binary 0/1 coded, and are being recoded to no/yes.
Summary of the data
- Now that things are coded correctly, we can look at an overall summary
birthwt.below.2500 mother.age mother.weight race mother.smokes
no :130 Min. :14.00 Min. : 80.0 white:96 no :115
yes: 59 1st Qu.:19.00 1st Qu.:110.0 black:26 yes: 74
Median :23.00 Median :121.0 other:67
Mean :23.24 Mean :129.8
3rd Qu.:26.00 3rd Qu.:140.0
Max. :45.00 Max. :250.0
previous.prem.labor hypertension uterine.irr physician.visits birthwt.grams
Min. :0.0000 no :177 no :161 Min. :0.0000 Min. : 709
1st Qu.:0.0000 yes: 12 yes: 28 1st Qu.:0.0000 1st Qu.:2414
Median :0.0000 Median :0.0000 Median :2977
Mean :0.1958 Mean :0.7937 Mean :2945
3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:3487
Max. :3.0000 Max. :6.0000 Max. :4990
A simple table
- Let’s use the
summarize()andgroup_by()functions to see what the average birthweight looks like when broken down by race and smoking status. To make the printout nicer we’ll round to the nearest gram.
tbl.mean.bwt <- birthwt %>%
group_by(race, mother.smokes) %>%
summarize(mean.birthwt = round(mean(birthwt.grams), 0))`summarise()` has grouped output by 'race'. You can override using the `.groups` argument.
# A tibble: 6 x 3
# Groups: race [3]
race mother.smokes mean.birthwt
<fct> <fct> <dbl>
1 white no 3429
2 white yes 2827
3 black no 2854
4 black yes 2504
5 other no 2816
6 other yes 2757
- Questions you should be asking yourself:
- Does smoking status appear to have an effect on birth weight?
- Does the effect of smoking status appear to be consistent across racial groups?
- What is the association between race and birth weight?
A simple reshape
- Some of these questions might be easier if we had the data in a wide rather than a long format. Here’s how we can do that with the
spread()function fromtidyr - The basic
spread()call isspread(data, key, value)
# A tibble: 3 x 3
# Groups: race [3]
race no yes
<fct> <dbl> <dbl>
1 white 3429 2827
2 black 2854 2504
3 other 2816 2757
What if we wanted nicer looking output?
- Let’s use the header
{r, results='asis'}, along with thekable()function from theknitrlibrary
# Save the table from before as a
# Print nicely
kable(spread(tbl.mean.bwt, mother.smokes, mean.birthwt),
format = "markdown")| race | no | yes |
|---|---|---|
| white | 3429 | 2827 |
| black | 2854 | 2504 |
| other | 2816 | 2757 |
kable()outputs the table in a way that Markdown can read and nicely displayNote: changing the CSS changes the table appearance
Example: Association between mother’s age and birth weight?
- Is the mother’s age correlated with birth weight?
[1] 0.09031781
- Does the correlation vary with smoking status?
# A tibble: 2 x 2
mother.smokes cor_bwt_age
* <fct> <dbl>
1 no 0.201
2 yes -0.144
Does the association between birthweight and mother’s age vary by race?
# A tibble: 3 x 2
race cor_bwt_age
* <fct> <dbl>
1 white 0.166
2 black -0.329
3 other -0.0293
There does look to be variation, but we don’t know if it’s statistically significant without further investigation.
Graphics in R
We now know a lot about how to tabulate data
It’s often easier to look at plots instead of tables
We’ll now talk about some of the standard plotting options
Easier to do this in a live demo…
Please refer to .Rmd version of lecture notes for the graphics material
Standard graphics in R
Single-variable plots
Let’s continue with the birthwt data from the MASS library.
Here are some basic single-variable plots.
par(mfrow = c(2,2)) # Display plots in a single 2 x 2 figure
plot(birthwt$mother.age)
with(birthwt, hist(mother.age))
plot(birthwt$mother.smokes)
plot(birthwt$birthwt.grams)Note that the result of calling plot(x, ...) varies depending on what x is.
- When x is numeric, you get a plot showing the value of x at every index.
- When x is a factor, you get a bar plot of counts for every level
Let’s add more information to the smoking bar plot, and also change the color by setting the col option.
par(mfrow = c(1,1))
plot(birthwt$mother.smokes,
main = "Mothers Who Smoked In Pregnancy",
xlab = "Smoking during pregnancy",
ylab = "Count of Mothers",
col = "lightgrey")(much) better graphics with ggplot2
Introduction to ggplot2
ggplot2 has a slightly steeper learning curve than the base graphics functions, but it also generally produces far better and more easily customizable graphics.
There are two basic calls in ggplot:
qplot(x, y, ..., data): a “quick-plot” routine, which essentially replaces the baseplot()ggplot(data, aes(x, y, ...), ...): defines a graphics object from which plots can be generated, along with aesthetic mappings that specify how variables are mapped to visual properties.
plot vs qplot
Here’s how the default scatterplots look in ggplot compared to the base graphics. We’ll illustrate things by continuing to use the birthwt data from the MASS library.
I’ve snuck the with() command into this example. with() allows you to use the variables in a data frame directly in evaluating the expression in the second argument.
Remember how it took us some effort last time to add color coding, use different plotting characters, and add a legend? Here’s the qplot call that does it all in one simple line.
qplot(x=mother.age, y=birthwt.grams, data=birthwt,
color = mother.smokes,
shape = mother.smokes,
xlab = "Mother's age (years)",
ylab = "Baby's birthweight (grams)") This way you won’t run into problems of accidentally producing the wrong legend. The legend is produced based on the colour and shape argument that you pass in. (Note: color and colour have the same effect. )
ggplot function
The ggplot2 library comes with a dataset called diamonds. Let’s look at it
[1] 53940 10
# A tibble: 6 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
It is a data frame of 53,940 diamonds, recording their attributes such as carat, cut, color, clarity, and price.
We will make a scatterplot showing the price as a function of the carat (size). (The data set is large so the plot may take a few moments to generate.)
The data set looks a little weird because a lot of diamonds are concentrated on the 1, 1.5 and 2 carat mark.
Let’s take a step back and try to understand the ggplot syntax.
The first thing we did was to define a graphics object,
diamond.plot. This definition told R that we’re using thediamondsdata, and that we want to displaycaraton the x-axis, andpriceon the y-axis.We then called
diamond.plot + geom_point()to get a scatterplot.
The arguments passed to aes() are called mappings. Mappings specify what variables are used for what purpose. When you use geom_point() in the second line, it pulls x, y, colour, size, etc., from the mappings specified in the ggplot() command.
You can also specify some arguments to geom_point directly if you want to specify them for each plot separately instead of pre-specifying a default.
Here we shrink the points to a smaller size, and use the alpha argument to make the points transparent.
If we wanted to let point color depend on the color indicator of the diamond, we could do so in the following way.
diamond.plot <- ggplot(data=diamonds, aes(x=carat, y=price, colour = color))
diamond.plot + geom_point()If we didn’t know anything about diamonds going in, this plot would indicate to us that D is likely the highest diamond grade, while J is the lowest grade.
We can change colors by specifying a different color palette. Here’s how we can switch to the cbPalette we saw last class.
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
diamond.plot <- ggplot(data=diamonds, aes(x=carat, y=price, colour = color))
diamond.plot + geom_point() + scale_colour_manual(values=cbPalette)To make the scatterplot look more typical, we can switch to logarithmic coordinate axis spacing.
Conditional plots
We can create plots showing the relationship between variables across different values of a factor. For instance, here’s a scatterplot showing how diamond price varies with carat size, conditioned on color. It’s created using the facet_wrap(~ factor1 + factor2 + ... + factorn) command.
diamond.plot <- ggplot(data=diamonds, aes(x=carat, y=price, colour = color))
diamond.plot + geom_point() + facet_wrap(~ cut)You can also use facet_grid() to produce this type of output.
ggplot can create a lot of different kinds of plots, just like lattice. Here are some examples.
| Function | Description |
|---|---|
geom_point(...) |
Points, i.e., scatterplot |
geom_bar(...) |
Bar chart |
geom_line(...) |
Line chart |
geom_boxplot(...) |
Boxplot |
geom_violin(...) |
Violin plot |
geom_density(...) |
Density plot with one variable |
geom_density2d(...) |
Density plot with two variables |
geom_histogram(...) |
Histogram |
Visualizing means
Previously we calculated the following table:
tbl.mean.bwt <- birthwt %>%
group_by(race, mother.smokes) %>%
summarize(mean.birthwt = round(mean(birthwt.grams), 0))`summarise()` has grouped output by 'race'. You can override using the `.groups` argument.
# A tibble: 6 x 3
# Groups: race [3]
race mother.smokes mean.birthwt
<fct> <fct> <dbl>
1 white no 3429
2 white yes 2827
3 black no 2854
4 black yes 2504
5 other no 2816
6 other yes 2757
We can plot this table in a nice bar chart as follows:
# Define basic aesthetic parameters
p.bwt <- ggplot(data = tbl.mean.bwt,
aes(y = mean.birthwt, x = race, fill = mother.smokes))
# Pick colors for the bars
bwt.colors <- c("#009E73", "#999999")
# Display barchart
p.bwt + geom_bar(stat = "identity", position = "dodge") +
ylab("Average birthweight") +
xlab("Mother's race") +
guides(fill = guide_legend(title = "Mother's smoking status")) +
scale_fill_manual(values=bwt.colors)Does the association between birthweight and mother’s age depend on smoking status?
We previously ran the following command to calculate the correlation between mother’s ages and baby birthweights broken down by the mother’s smoking status.
# A tibble: 2 x 2
mother.smokes cor_bwt_age
* <fct> <dbl>
1 no 0.201
2 yes -0.144
Here’s a visualization of our data that allows us to see what’s going on.
ggplot(birthwt,
aes(x=mother.age, y=birthwt.grams, shape=mother.smokes, color=mother.smokes)) +
geom_point() + # Adds points (scatterplot)
geom_smooth(method = "lm") + # Adds regression lines
ylab("Birth Weight (grams)") + # Changes y-axis label
xlab("Mother's Age (years)") + # Changes x-axis label
ggtitle("Birth Weight by Mother's Age") # Changes plot title`geom_smooth()` using formula 'y ~ x'
License
This document is created for Math 514, Spring 2021, at Illinois Tech. While the course materials are generally not to be distributed outside the course without permission of the instructor, this particular set of notes is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This worksheet is extracted from Prof. Alexandra Chouldechova at CMU, under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.